Purpose and Goals

The purpose of this code is to perform exploratory data analysis on the mortality data in the nc2016.csv data set that I prepared in the data aggregation project. (Please see the 2016Mort_dataAggr.html file to view the data aggregation project.)
My goals for this project are to check linearity, normality, analyze the distribution of the data, discover any correlations and check for collinearity and multicollinearity.


How to read this document

Before we begin the analyses here is a brief explanation of the layout of this document and what each of the parts are.
This page displays code, output, and discussions about the output. The layout is as follows:
1. Code is displayed in gray “code boxes”

# This is a code box
x <- c(1,3,2,4,8,4,7)
y <- c(9,3,1,5,7,2,6)
plot(x, y, col='#04918a', main="Output of code", type = "p", pch = 19)
  1. Output of the code
  2. Discussion about results/findings


Bring in the data set

nc2016 <- read.csv("./output/nc2016.csv")
nc2016


Descriptive statistics for a few variables of interest

stargazer(nc2016, type = 'html',
          keep = c("mentUnhDays", "perAccToExer", "perExcDrink", "Unemploy", "perUnemploy", "perInsSleep", "foodEnvIndx", "chlamyd", "perChlamyd", "tBirthRate", "violenCrime", "vioCrimeRate", "ageAdjRate"),
          covariate.labels = c("Mentally Unhealthy Days", "Access to exercise opportunities", "Excessive drinking", "Unemployment", "Unemployment Rate", "Percent Insufficient Sleep", "Food environment index", "Chlamydia Cases", "Chlamydia Rate", "Teen Birth Rate", "Violent Crimes", "Violent Crime Rate", "Aged Adjusted Mortality Rate"),
          title = "Descriptive Statistics")
Descriptive Statistics
Statistic N Mean St. Dev. Min Max
Mentally Unhealthy Days 100 4.211 0.299 3.368 5.400
Access to exercise opportunities 100 68.250 18.950 19.330 100.000
Excessive drinking 100 16.050 1.894 12.130 22.620
Unemployment 100 2,464.000 3,931.000 128 27,415
Unemployment Rate 100 5.640 1.184 3.844 9.406
Percent Insufficient Sleep 100 33.660 2.771 28.600 39.570
Food environment index 98 7.202 0.870 4.100 8.500
Chlamydia Cases 100 643.800 1,271.000 10 9,276
Chlamydia Rate 100 0.005 0.003 0.001 0.012
Teen Birth Rate 100 0.034 0.011 0.006 0.061
Violent Crimes 95 341.100 686.700 6.333 5,376.000
Violent Crime Rate 95 0.003 0.001 0.001 0.007
Aged Adjusted Mortality Rate 100 8.265 0.974 5.368 10.500


Histograms and Boxplots

par(mfrow=c(1,2))
hist(nc2016$ageAdjRate, main = "Age Adjusted Mortality Rate", xlab = "Rate", col = "purple")

boxplot(nc2016$ageAdjRate, horizontal = TRUE, main = "Age Adjusted Mortality Rate", xlab = "Rate", col = "purple")


We can see from the above histogram and boxplot that the Age Adjusted Mortality Rate variable is distributed close to normally and so it will be fine as it is as a dependent (response) variable.

par(mfrow=c(2,2))
hist(nc2016$mentUnhDays, main = "Mentally Unhealthy Days", xlab = "Days", col = "purple")
hist(nc2016$perInsSleep, main = "Percent Insufficient Sleep", xlab = "Percent with Insufficient Sleep", col = "purple")

boxplot(nc2016$mentUnhDays, horizontal = TRUE, main = "Mentally Unhealthy Days", xlab = "Days", col = "purple")
boxplot(nc2016$perInsSleep, horizontal = TRUE, main = "Percent Insufficient Sleep", xlab = "Percent with Insufficient Sleep", col = "purple")


Both the above variables appear to be normally distributed, however Mentally Unhealthy Days does have a couple outliers to consider.

par(mfrow=c(2,2))
hist(nc2016$perAccToExer, main = "Access to Exercise Opportunities", xlab = "Percent with access", col = "purple")
hist(nc2016$perExcDrink, main = "Excessive Drinking", xlab = "Percent excessive drinking", col = "purple")

boxplot(nc2016$perAccToExer, horizontal = TRUE, main = "Access to Exercise Opportunities", xlab = "Percent with access", col = "purple")
boxplot(nc2016$perExcDrink, horizontal = TRUE, main = "Excessive Drinking", xlab = "Percent excessive drinking", col = "purple")


Access to Exercise Opportunities is slightly skewed left while Excessive Drinking is skewed right with an outlier.

par(mfrow=c(2,2))
hist(nc2016$foodEnvIndx, main = "Food Environment Index", xlab = "Index", col = "purple")
hist(nc2016$tBirthRate, main = "Teen Birth Rate", xlab = "Rate", col = "purple")

boxplot(nc2016$foodEnvIndx, horizontal = TRUE, main = "Food Environment Index", xlab = "Index", col = "purple")
boxplot(nc2016$tBirthRate, horizontal = TRUE, main = "Teen Birth Rate", xlab = "Rate", col = "purple")


Food Environment Index is skewed left and Teen Birth Rate is mostly normally distributed.

par(mfrow=c(2,2))
hist(nc2016$violenCrime, main = "Violent Crimes", xlab = "Number of violent crimes", col = "purple")
hist(nc2016$vioCrimeRate, main = "Violent Crime Rate", xlab = "Rate", col = "purple")

boxplot(nc2016$violenCrime, horizontal = TRUE, main = "Violent Crimes", xlab = "Number of violent crimes", col = "purple")
boxplot(nc2016$vioCrimeRate, horizontal = TRUE, main = "Violent Crime Rate", xlab = "Rate", col = "purple")


Both of these variables are skewed right but the Violent Crime Rate is not as strongly skewed. I will include Violent Crime Rate in my analyses going forward but not Violent Crimes.

par(mfrow=c(2,2))
hist(nc2016$chlamyd, main = "Chlamydia Cases", xlab = "Number of violent crimes", col = "purple")
hist(nc2016$perChlamyd, main = "Chlamydia Rate", xlab = "Percent with STI", col = "purple")

boxplot(nc2016$chlamyd, horizontal = TRUE, main = "Chlamydia Cases", xlab = "Number of violent crimes", col = "purple")
boxplot(nc2016$perChlamyd, horizontal = TRUE, main = "Chlamydia Rate", xlab = "Percent with STI", col = "purple")


Again both of these variables are skewed right but the Chlamydia Rate is not as strongly skewed. I will include Chlamydia Rate in my analyses going forward but not Chlamydia Cases.

par(mfrow=c(2,2))
hist(nc2016$Unemploy, main = "Number Unemployed", xlab = "Rate", col = "purple")
hist(nc2016$perUnemploy, main = "Percent Unemployed", xlab = "Rate", col = "purple")

boxplot(nc2016$Unemploy, horizontal = TRUE, main = "Number Unemployed", xlab = "Rate", col = "purple")
boxplot(nc2016$perUnemploy, horizontal = TRUE, main = "Percent Unemployed", xlab = "Rate", col = "purple")


And here too we see that both variables are right skewed but the Percent Unemployed is less skewed so I will consider only the Percent Unemployed in my future analyses.

Checks for Linear Relationship

feats <- c("ageAdjRate", "mentUnhDays", "perInsSleep", "perAccToExer", "perExcDrink", "tBirthRate", "vioCrimeRate", "perChlamyd", "perUnemploy")
pairs(nc2016[ ,feats], col = "purple")


Displayed above is a matrix of scatterplots between the variables in which I am interested. When looking at ageAdjRate we see an increasing linear trend with a majority of the covariates. ageAdjRate and perAccToExer appears to have little to no trend, possibly a slight negative linear trend. And ageAdjRate and perExcDrink has a negative linear trend. There does not appear to be a strong correlation between ageAdjRate and the covariates except for tBirthRate which appears to be moderately correlated.

Checks for Multicollinearity

m_noNA <- nc2016 %>% mutate_if(is.numeric, ~replace(., is.na(.), 0))
feats <- c("ageAdjRate", "mentUnhDays", "perAccToExer", "perExcDrink", "tBirthRate", "vioCrimeRate", "perInsSleep", "perChlamyd", "perUnemploy")
corr <- cor(m_noNA[ , feats])
corrplot(corr, type = "upper", tl.cex = 1.15, tl.col = "black")


In the above correlation plot/heatmap we see that tBirthRate has the strongest positive correlation with ageAdjRate (~0.6) and perExcDrink has the strongest negative correlation with ageAdjRate (~-0.4). We do not see any highly correlated covariates (correlation > 0.9).

m_noNA <- nc2016 %>% mutate_if(is.numeric, ~replace(., is.na(.), 0))
feats <- c("ageAdjRate", "mentUnhDays", "perAccToExer", "perExcDrink", "tBirthRate", "vioCrimeRate", "perInsSleep", "perChlamyd", "perUnemploy")
corr <- cor(m_noNA[ , feats])
upper <- corr
upper[upper.tri(corr)] <- ""
upper <- as.data.frame(upper)
upper


In the above table we can see the numerical correlation value of the covariates. We observe that all the values are less than 0.9 and therefore may conclude that there is no collinearity or multicollinearity.

Another view using ggpairs function from the GGally package

ggpairs(nc2016[ ,feats],
        upper = list(continuous = wrap(ggally_cor, stars = F, size = 5)),
        lower = list(continuous = wrap(ggally_points, col='#005F61')),
        diag = list(continuous = wrap(ggally_densityDiag, col = '#6600DB')))


In this view we see the scatterplots for ageAdjRate along the first column with the associated correlation coefficients along the first/top row. Along the diagonal are the density curves of each of the variables which displays the distribution of the variables in the data set.


Next steps

I will continue my analysis with some statistical modeling and tests in the file 2016Mort_models_tests.html.